[Prototype] Integration with PipelineRL: streaming dataset and trainer events with weight broadcasting by bigximik · Pull Request #389 · ServiceNow/Fast-LLM

bigximik · 2025-11-17T07:41:31Z

✨ Description

This PR provides the initial integration with PipelineRL.

It introduces:

Streaming Redis Dataset — capable of consuming documents such as rollouts from an external Redis stream.
Trainer Events System — supports publishing events like training_finished, initial_weights_step and weights_ready over a dedicated Redis channel.
Weights Broadcast Mechanism — uses a separate external NCCL rendezvous point to broadcast updated model weights in real time to inference servers.

This enables seamless coordination between Fast-LLM training and PipelineRL-based inference or orchestration components.

Closes #

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

List the key changes introduced in this PR:

Change A
Change B

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

📜 I have read and followed the contributing guidelines.
🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
🎉 The functionality is complete, and I have tested the changes.
📝 I have updated the documentation if needed.
⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration

🐋 I have updated the Docker configuration or dependencies, if applicable.
🔄 I have ensured compatibility with the existing setup after dependency changes.

Testing

🧪 I have added or updated tests to cover my changes.
✔️ New and existing tests pass locally with my changes.
🚦 I have tested these changes on GPUs and verified training stability.
🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact

📊 I have run benchmarks where applicable to evaluate the performance impact.
✅ The benchmarks show no performance regression.
🚀 The benchmarks indicate a potential performance improvement.
⚠️ The benchmarks indicate a potential performance degradation.
📈 I have provided benchmark results and detailed any performance impact below, if applicable.

📊 Performance Impact Details

If there is any impact on performance, describe it and provide benchmark results, if applicable:

🗒️ Additional Notes

Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.

…M into denis/new_datasets

…enis/new_datasets

jlamypoirier · 2025-12-16T15:52:40Z

fast_llm/engine/distributed/config.py

                )

-        super()._validate()
+            rank, global_ranks = self._get_model_and_sequence_data_rank_and_global_ranks()


We don't care about PP so this is the same as tensor_and_sequence_data right?

Yes, but you cannot easily calculate global ranks for tensor_and_sequence_data is some configuration
variation, right? In that case, calculating them via simulation might actually be a good approach, even for
tensor_and_sequence_data.

Also, are we dropping pipeline parallelism entirely? If not, would it be better to use model_and_sequence_data?

The ranks for tensor_and_sequence_data are easy to calculate https://github.com/ServiceNow/Fast-LLM/pull/389/files#diff-c76c17be20bd8a5658b40b2b9301c3fce5d6c06cf5341a8767f9339e36378f90R358. model_and_sequence_data has a slightly more complicated computation because it has two strides, but let's not worry about it.

I’ve seen your latest change in the distributed setup — thanks, I agree it’s possible. However, if we want to change the arrangement, we’ll need to modify the ranks and strides function again. With the simulation, you only change the assigned ranks and the groups are created automatically.

Why would we want to change something? For example, I saw in one configuration that we have more than one data group on the same node, while its tensor and pipeline parallel parts are placed on another node. As a result, we end up with two streaming dataset readers on one node and none on another.

Ideally, we would want to have one data group per node, with its TP and PP parts on the same node as much as possible. At the very least, we should avoid having two data-group leaders on the same node. What do you think?

…ut consumer gourp, no tests

jlamypoirier · 2025-12-16T16:34:54Z

fast_llm/engine/training/trainer_events.py

@@ -0,0 +1,105 @@
+import logging
+
+import orjson


Any good reason why we need this over the stock json?

It is used in PipelineRL as it is much faster, i decided to use the same on our side

tests/data/gptdata_streaming_test.py

…ew_datasets

bigximik · 2026-01-12T07:40:18Z

Superseded by #428 — refactoring and simplification of the prototype, along with changes to distributed code and tests.

jlamypoirier and others added 30 commits October 14, 2025 22:52

Dataset interface

1a18929

misc

fd63846

fix

2486caf

Language model sample

92e93e8

fix

d6f6944

fixes

5c802fa

test

95d1840

fixes

eafd9cb

cleanup

c56df69

misc

7f437e1

misc

dfd27f5

Memmap dataset

90cd009

fixes

acfd30e

fixes

34939e9

int64

c5fa072

Test and fix preparator

cd28676

fix

435d214

fix

f6bef55

fix

e05d9a1

fix

9ba8d1b

fixes

b35b297

misc

abe2357

fix

1801d87

fix right stage mode

2223b85

newer transformers fixes

a9a4ace

fix distributed tests skip on single gpu

97f2b60

set mamba 2 style model conversions to broke

0fdc978

Merge branch 'jlp/dataset_interface' of github.com:ServiceNow/Fast-LL…

665deb5

…M into denis/new_datasets

Merge branch 'jlp/lm_sample' of github.com:ServiceNow/Fast-LLM into d…

4d03889

…enis/new_datasets

mmaba2 enable conversion tests

224c2ec

bigximik added 2 commits December 10, 2025 08:19

update test for changed config

5230b74

added 2 gpus trainer events test

1a94de5

bigximik changed the title ~~Staging Area for the new datasets implementation and Streaming Dataset Development~~ [Prototype] Integration with PipelineRL: streaming dataset and trainer events with weight broadcasting Dec 10, 2025

bigximik added 10 commits December 10, 2025 14:31

fix for multiple gpus

6cfd445

updated test to multiple gpus

333665d

added not implemented for pp streaming

5d1f474

removed PipelineRL sample and batch

5f7cb29

base radis and streaming dataset config class refactoring

d07a900

refactoring of redis config, trainer event config, corresponding tests

3a7ba92

removed eof message which is not supported

59f6f7d

added implementation for initial_weights_step_message_type event

2c20ebd

removed explicit msg ack

f4107c3

fix of training finished event

c32ef89

jlamypoirier reviewed Dec 16, 2025

View reviewed changes

bigximik and others added 3 commits December 16, 2025 16:16

alternative streaming immplementaions: one stream and n streams witho…

f637649

…ut consumer gourp, no tests

Merge remote-tracking branch 'origin/main' into denis/new_datasets

e43ce95

merge from main

5545598

jlamypoirier reviewed Dec 16, 2025

View reviewed changes

fix after merge added preprocessing empty configs

0d198ff

jlamypoirier reviewed Dec 16, 2025

View reviewed changes

tests/data/gptdata_streaming_test.py Outdated Show resolved Hide resolved

bigximik and others added 7 commits December 16, 2025 17:19

fix for tests with no import

70ef5c4

fixes

058c93c

Merge remote-tracking branch 'origin/denis/new_datasets' into denis/n…

d34d39a

…ew_datasets

Merge remote-tracking branch 'origin/main' into denis/new_datasets

ffb0a5f

removed cloudpickle

359231f

Simplify distributed

ca9e94e

Merge remote-tracking branch 'origin/main' into denis/new_datasets

ddd841d

jlamypoirier closed this Jan 29, 2026

jlamypoirier deleted the denis/new_datasets branch January 31, 2026 01:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Prototype] Integration with PipelineRL: streaming dataset and trainer events with weight broadcasting#389

[Prototype] Integration with PipelineRL: streaming dataset and trainer events with weight broadcasting#389
bigximik wants to merge 84 commits intomainfrom
denis/new_datasets

bigximik commented Nov 17, 2025 •

edited

Loading

Uh oh!

jlamypoirier Dec 16, 2025

Uh oh!

bigximik Dec 16, 2025 •

edited

Loading

Uh oh!

jlamypoirier Dec 16, 2025

Uh oh!

bigximik Dec 22, 2025

Uh oh!

jlamypoirier Dec 16, 2025

Uh oh!

bigximik Dec 16, 2025

Uh oh!

Uh oh!

bigximik commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bigximik commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

🔍 Type of change

📝 Changes

✅ Checklist

General

Dependencies and Configuration

Testing

Performance Impact

📊 Performance Impact Details

🗒️ Additional Notes

Uh oh!

jlamypoirier Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

bigximik Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jlamypoirier Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

bigximik Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

jlamypoirier Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

bigximik Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bigximik commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bigximik commented Nov 17, 2025 •

edited

Loading

bigximik Dec 16, 2025 •

edited

Loading